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Correlations in the signal observed via functional Magnetic Resonance Imaging (fMRI) , are expected to reveal the inter- 
actions in the underlying neural populations through hemodynamic response. In particular, they highlight distributed 
set of mutually correlated regions that correspond to brain networks related to different cognitive functions. Yet graph- 
theoretical studies of neural connections give a different picture: that of a highly integrated system with small- world 
properties: local clustering but with short pathways across the complete structure. We examine the conditional indepen- 
dence properties of the fMRI signal, i.e. its Markov structure^ to find realistic assumptions on the connectivity structure 
that are required to explain the observed functional connectivity. In particular we seek a decomposition of the Markov 
structure into segregated functional networks using decomposable graphs: a set of strongly-connected and partially over- 

I lapping cliques. We introduce a new method to efficiently extract such cliques on a large, strongly-connected graph. We 

'compare methods learning different graph structures from functional connectivity by testing the goodness of fit of the 
model they learn on new data. We find that summarizing the structure as strongly-connected networks can give a good 
description only for very large and overlapping networks. These results highlight that Markov models are good tools to 

.identify the structure of brain connectivity from fMRI signals, but for this purpose they must reflect the small- world 

'properties of the underlying neural systems. 
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1. Introduction 

The study of distant correlations in functional Mag- 
netic Resonance Imaging (fMRI) signals, revealing brain 
functional connectivity^ has opened the door to fundamen- 
■tal insights on the functional architecture of the brain 
'[1, 2]. Among other features, the concept of cognitive net- 
work has emerged as one of the leading views in current 
cognitive neuroscientific studies: regions that activate to- 
gether form an integrated network corresponding to a cog- 
nitive function [1]. In other words, these networks can in 
general be identified to sets of active regions observed in 
traditional brain mapping experiments, thus in line with a 
segregative view of brain functional organization. On the 
other hand, graph-theoretical analysis of brain connectiv- 
ity has shown that it displays small- world properties: any 
two points of the brain can be connected through few inter- 
mediate steps, despite the fact that most nodes maintain 
only a few direct connections [2]. There is an apparent 
contradiction between those two points of views: can the 
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brain be separated in functionally-specialized networks, or 
are these simply a convenient but misleading representa- 
tion to interpret the data? 

Establishing a link between the functional connectivity 
observed in fMRI and the underlying neural architecture 
can be a challenging task. Indeed, the fMRI signal is very 
noisy and reflects many non neural effects, such as cardiac 
and respiratory rhythms or head and body motion. In 
addition, statistical estimation of brain interactions from 
fMRI data is made difficult by the small number of obser- 
vations available per experimental run, and by the lack of 
salient structure in this data. Finally, it has become a com- 
mon practice to study these interactions during resting- 
state^ i.e. in the absence of any stimulation, the subject 
resting eyes-closed in the scanner; in that case functional 
connectivity is loosely identified with the correlation struc- 
ture of ongoing activity across regions. 

In this paper, we use advanced statistical techniques 
to investigate the structure of brain connections that is 
captured by resting-state fMRI recordings. For this pur- 
pose, we infer the Markov network structure of the signals 
-the underlying independence graph- using a statistical 
framework well suited to modeling correlations: Gaussian 
graphical models. We compare various estimation meth- 
ods adapted to different graph topologies. In particular. 
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we introduce decomposable Markov models to describe 
functional networks: sets of distant regions working to- 
gether. Importantly these tools enable to study the extent 
-i.e. the clique size- of brain functional networks that is 
required to give a good description of the signal. Our work 
introduces a probabilistic description of brain activity that 
encompasses features of distributed cognitive network and 
small- world systems. 

The layout of the article is the following. First we 
review different pictures of brain connectivity emerging 
from a vast body of anatomical and structural studies. In 
section 3, we introduce the statistical tools that we use 
to estimate functional connectivity graphs from the fMRI 
signal. In section 4, a method for learning decomposable 
models of brain activity is presented. Finally, in section 5, 
we estimate graphical models of brain activity with vari- 
ous methods and quantify their ability to fit unseen data 
as a function of the imposed graph topology: a strong 
segregation into well-separated distributed networks, or a 
highly-connected system. 

2. Brain structure: from small- world connectivity 
to functional networks 

Integration and segregation in neural systems. Beyond the 
historical opposition of localizationist and holist views [3] , 
brain organization is generally considered as reflecting the 
fundamental principles of segregation into functionally- 
specialized sub-systems, and integration across these sys- 
tems to sustain high-level cognitive functions [4]. Origi- 
nally highlighted in theoretical descriptions of neural sys- 
tems [5] , this general organization shapes the properties of 
brain-wide connectivity: local clustering and global con- 
nectivity [6, 2] that reflects functional integration [7]. In 
high-level cognition, it can been seen as the duality of local 
circuits and global networks [8], for instance in conscious 
information integration [9]. Finally, there is growing ev- 
idence from functional imaging, that this segregation in- 
creases during brain development [10, 11]. 

Small word graphs. A quantitative description of these 
structural properties can be given using graph theory. In 
mathematics and computer science, a graph is the abstract 
object defined by a set of nodes and their connections. In 
social sciences or neuroscience, the word "network" is often 
employed instead of "graph" , but in this paper, we will re- 
serve this term to denote cognitive or functional networks. 
Watts and Strogatz [12] popularized the concept of small- 
world properties for graphs: graphs with a small fraction 
of nodes connected, but with a connection topology such 
that only a short chain is required to connect any pair 
of nodes. These efficient transport properties are achieved 
via a local clustering into communities^ and a few edges in- 
terconnecting the communities across the graph. A graph 
can be characterized by a variety of statistical properties of 
its connections [13, 14, 2]. Most often, small- world graphs 
are studied via the relative importance of a local graph 
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Figure 1: Intrinsic functional networks extracted from fMRI 
datasets of on-going activity, adapted from [28]. (a) Default mode 
network [29]. (b) Language network. 

metric, such as their clustering coefficient, and a global 
metric, e.g. their average path length [12, 13, 15]. The 
fundamental characteristic of a small- world graph remains 
that it displays high sparsity and yet good transport across 
the graph. Watts and Strogatz [12] stress that, for sparse 
graphs, small- worldness is determined by its global prop- 
erties: '^at the local level (as reflected by [the clustering 
coefficient]), the transition to a small world is almost un- 
detectable^' while '%n sparse networks with many vertices 
[...] even a tiny fraction of short cuts would suffice^'. 

Graphs of brain connectivity. The physical connections 
between gray-matter areas can be measured in-vivo via 
tractography of diffusion MRl [16, 17]. They establish a 
picture of the human brain architecture at an anatomical- 
connectivity level [18, 19]. In non- human animals, system- 
atic post-mortem studies have enabled detailed and com- 
plete mapping of connections, e.g. for the macaque brain 
[20, 21]. This anatomical connectivity forms the substrate 
of information processing in the brain and reflects the func- 
tional segregation properties of neural systems [22]. From 
the point of view of graph theory, it has been shown to dis- 
play small- world properties [6, 2]. In fact. Watts and Stro- 
gatz [12] used the graph of neural connections of C. elegans 
in their seminal paper introducing small- world graphs. It 
is also possible to build graphs of functional connectivity 
by thresholding correlation matrices of brain activity, for 
instance from MEG [23] or fMRI data [24, 25, 26, 27]. As 
their anatomical counterparts, these have been shown to 
display small-world properties: local clustering and short 
average path length. 

Intrinsic functional networks. The seminal work of Biswal 
et al. [30] showed that correlations in resting-state brain 
activity, i.e. in the absence of task, can be used to re- 
veal brain systems dedicated to a cognitive function, often 
called intrinsic functional networks. Indeed, in the pres- 
ence or in the absence of explicit cognitive task, a large 
fraction of brain activity is due to on- going functional pro- 
cesses. After Shulman et al. [31] used task-independent 
spatial correlations maps to uncover a new and important 



2 



functional network [29] , analysis of fluctuations in on-going 
activity has been used to systematically map the large- 
scale intrinsic functional architecture of the brain [1, 32]. 
Data-driven extraction of the main functional networks 
from resting-state data has been shown to be stable across 
subjects [33, 28]. These networks form a relevant descrip- 
tion of the brain: they are found alike in on-going activity 
and in task-driven experiments, and stand out as land- 
mark structures in meta-analyses carried out on current 
functional protocols [34]; they provide relevant descrip- 
tors for large scale pathologies such as neurodegenerative 
diseases [35]; and they reflect the underlying anatomical 
connectivity [36, 37]. An important aspect of these net- 
works is that many correspond directly to set of regions 
that are specifically recruited for certain cognitive pro- 
cesses, including higher-order functions, such as language 
(Fig. l.b) or executive function [38]. This decomposition 
of on-going activity offers a insightful view on the func- 
tional architecture of the brain, that is the interplay of a 
set of weakly-overlapping, functionally-specialized, brain 
networks: ''the human brain is intrinsically organized into 
dynamic, anticorrelated functional networks" [39]. 

Criticism of small-world analyses. The analysis of brain 
architecture in terms of its graphical properties or its in- 
trinsic functional networks gives differing points of view: 
on the one hand the distance on the connectivity graph 
between any two nodes of the small-world brain is small; 
on the other hand the brain appears as decomposed in 
functionally-specialized and weakly-overlapping networks 
of regions. While these two descriptions are not mutually 
exclusive, small- world analyses have drawn some criticism, 
loannides [40] criticizes the random-graph description of 
brain networks, as it is based on a small number of graph 
metrics that give a fairly unspecific characterization of the 
brain. He argues for a ''hierarchy of networks rather than 
the single, binary networks that are currently in vogue". 
More fundamentally, loannides raises the issue that these 
properties can easily be confounded by observation noise: 
"[in M/EEG] the activation of few (or many) uncorrected 
generators will [...] produce a small-world topology in a 
network computed from the raw signal topography" [40]. 
Indeed, Deuker et al. [41] find an inter-class correlation of 
less than 0.5 on the small- worldness index in a test-retest 
study using MEG for 2.5 minutes^ of resting-state acqui- 
sition. In addition, simulation-based studies show that an 
observed small- world structure for a graph empirically de- 
rived from time-series may solely be explained by sampling 
effects [42, 43]. 

It is thus legitimate to ask whether the small-world 
properties of functional connectivity are indeed necessary 
to give a good description brain activity, or whether a 
picture of brain function in terms of separated cognitive 
networks is sufficient in view of the data. 



^2.5 minutes of acquisition time may seem short, yet for MEG, 
it provides orders of magnitude more observations than a standard 
fMRI experimental run 



3. Statistical tools for connectivity analysis: 
Markov and Gaussian graphical models 

Notations. To clarify the mathematical presentation, we 
apply the following conventions. We write sets with cap- 
ital letters, matrices with capital bold letters, A e R"^^^, 
and we denote || A|| the operator norm, || Ap = T.7,j=i ^Ij^ 
and II A 111 the £i norm, ||A||i = T.7,j=i ^ iden- 

tity matrix. Quantities estimated from the data at hand 
are written A. A~^ is the matrix inverse of A, A^ is the 
transposed matrix. Finally, we write A > 0, for a sym- 
metric matrix A with strictly positive eigenvalues, a 
positive definite matrix. 

3.1. Probabilistic modeling of brain network structure 

A limitation of the commonly used graph-theoretical 
descriptions of functional connectivity is that they do not 
form a probabilistic model of the functional signal. As 
such, they convey no natural notion of goodness of fit and 
hypothesis testing of the graph properties in the presence 
of noise is ill-defined. This is why we resort to the analysis 
of graphical models specifying a probability distribution 
for the signal. Namely we use Markov models, that we 
apply to the functional-connectivity correlation matrices 
in the framework of Gaussian graphical models. 

Markov models: conditional independence graphs. Mea- 
sures of functional connectivity are based on correlations 
in the BOLD signal. They cannot be easily interpreted in 
terms of transfer of information or effective connectivity, 
that is the influence that a neural system exerts over an- 
other [44]. Indeed, BOLD is a very indirect measure of 
neural firing, the MRI signal observed contains many non 
BOLD-related confounds, and correlation is a rough sum- 
mary of the dependence between two variables revealing, 
amongst other things, indirect effects. 

For these reasons, we focus our study on recovering the 
independence structure of the observed functional signals. 
With the data at hand, fMRI recordings with limited ob- 
servations available, this is a challenging statistical estima- 
tion problem. We use a class of probabilistic models called 
Markov models that specify the independence structure of 
the variables that it describes. It can be represented as 
a graph, in which case a node is statistically independent 
from nodes to which it is not directly connected, condi- 
tionally on its neighbors. 

In the context of brain functional connectivity, we are 
interested in inferring the graphical structure from the cor- 
relations in the observed activity. We frame this problem 
as estimating the Markov structure^ of a Gaussian graph- 
ical model of brain activity. 



^More specifically, we learn a undirected graph of conditional in- 
dependence. Such a model is often called a Markov Random Field 
model. 
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Gaussian graphical models. The study of functional con- 
nectivity focuses on second-order statistics of the signal, 
in other word Gaussian statistics. While the underlying 
signals are most-likely not Gaussian, these measures have 
been shown to capture well the independence structure 
underlying fMRI signals [45]. A multivariate Gaussian 
model with a specified Markov structure is called a Gaus- 
sian graphical model [46]. We apply this class of models 
to the study of brain connectivity. 

A centered Gaussian model is fully specified by the 
inverse of its covariance matrix, known as its precision 
matrix that closely relates to conditional correlations 
between its variables. The conditional independence be- 
tween variables in the model are given by the zeros in this 
matrix. Therefore the statistical estimation task that we 
are interested in, is to find these zeros in the precision 
matrix. 

We start from a brain activation dataset X e W^^ with 
p variables, the activation of p different brain regions, and 
n samples. We are interested in inferring a large-scale 
connectivity graph, that is between many nodes, from a 
comparatively small amount of observations. In these set- 
tings, estimating the covariance matrix, or the precision 
matrix, from the data is an ill-posed problem (see for in- 
stance [47, 48]). First, if n < ^p{p +1), the number of 
unknown parameters is greater than the number of sam- 
ples. Second, the various parameters to estimate are not 
mutually independent, as the estimated covariance has the 
constraint of being positive definite. 

Thus, the Markov structure of the data cannot be es- 
timated simply by thresholding the precision matrix. We 
resort to modern covariance estimation tools, namely co- 
variance selection, as detailed below. In this context, spec- 
ifying a Markov structure acts as a regularization for the 
covariance matrix estimation: it injects a prior informa- 
tion that a certain number of coefficients of the precision 
matrix are zero, and thus decreases the number of coeffi- 
cients to be estimated. 

Smith et al. [45] have shown on realistic simulations 
of brain functional connectivity that sparse ^i-penalized 
inverse covariance estimators performed well to recover the 
graphical structure from noisy data. Huang et al. [49] 
and Varoquaux et al. [7] used such a sparse covariance 
estimation procedure to infer the conditional independence 
structure of a Gaussian graphical model on full brain data. 

Assessing model fit. As a Gaussian graphical model de- 
fines a probability of observing data, its log-likelihood can 
be used as a natural metric for assessing the goodness of fit 
of the model for some test data X: for a model specified 
by its precision matrix 

C{X\K) = i (log det K-tiK Eemp.) + est (1) 
where Semp is the empirical covariance of the data: X)emp = 

n 



For our application, an important point is to sepa- 
rate properties of the functional connectivity networks ex- 
tracted from the signal from features learned on observa- 
tion noise particular to the experimental run. This prob- 
lem is well known in statistics under the term of overfit. 
We use a standard technique to control overfit known as 
cross-validation: having learned a probabilistic model on 
a given set of experiment run, we test its goodness of fit 
on unseen data. Features learned by chance do not gener- 
alize to unseen data, as it is independent from the training 
data. 

3.2. Covariance selection procedures 

As proposed by Dempster [50] , learning or setting con- 
ditional independence between variables can be used to 
improve the conditioning of an estimation problem, a tech- 
nique referred to as covariance selection. Improved covari- 
ance estimation can therefore be achieved by estimating 
from the data a precision matrix with a sparse support, 
i.e., a small number of non-zero coefficients. 

Selecting the non-zero coefficients to optimize the like- 
hhood of the model given the data (eq. 1) is a difficult 
combinatorial optimization problem. It is NP hard in the 
number of edges and it therefore cannot be solved on mod- 
erately large graphs. To tackle such problems efficiently, 
two strategies coexist: convex relaxation approaches that 
lead to optimal solutions for a new problem that is a con- 
vex approximation of the initial problem and greedy ap- 
proaches that find sub-optimal solutions. 

Sparse penalized inverse covariance estimation. A con- 
straint based on the number of non-zero coefficients leads 
to NP hard problems. It can be modified by replacing it 
by a constraint on the ii norm of the solution. This leads 
to a penalized estimator taking the form of the following 
convex optimization problem [51]: 

ki = argmax( log det K-tiK Eemp - A || K || i ) , (2) 

where || • || i is the element- wise £i norm of the off-diagonal 
coefficients in the matrix. This problem can be solved very 
efficiently in O(p^) time [52, 53]. This technique makes 
no assumption on the topology of the graph other than its 
sparsity. Its major limitation is that the ii penalty may 
bias the coefficients of the precision matrix. It was em- 
ployed successful to learn functional connectivity graphs 
by Huang et al. [49] and Varoquaux et al. [7] on full-brain 
fMRI data, as well as Smith et al. [45] on realistic simula- 
tions of neural interactions. 

PC-DAG. A greedy approach to learning the Markov in- 
dependence structure of observed data is given by the PC 
algorithm [54]. It consists in pruning edges of a graph 
by successively testing for conditional independence. Al- 
though it does not solve a global optimization criteria, this 
algorithm is proved to be consistent [54, 55]. In the case 
of covariance selection, it can be used efficiently to learn 
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a sparse precision matrix, as in the PC-DAG algorithm 
[56]: the PC algorithm is applied to estimate a Markov 
structure for the Gaussian graphical model, which is then 
used to estimate the precision matrix. An important point 
is that the Markov independence structure estimated is 
constructed as the moral graph^ of the (possibly) directed 
conditional independence relation extracted by the PC al- 
gorithm. This graph is limited by the PC algorithm to 
a small node degree. This implies that the graphs ex- 
tracted by the PC-DAG are unlikely to contain large cy- 
cles. This procedure is interesting because its computa- 
tional cost scales exponentially with the maximum num- 
ber of neighbors on the underlying graph, rather than the 
number of possible edges. 

Shrinkage estimates. Finally, a more naive but quite effi- 
cient approach in practice consists in regularizing the esti- 
mate of the precision matrix by adding a diagonal matrix 
to the empirical covariance before computing its inverse. 
It amounts to an £2 shrinkage by penalizing uniformly off- 
diagonal terms: 

k2 = (±em^+XI)-^ (3) 

Ledoit and Wolf [47] have introduced a closed formula for 
a choice of A leading to a good bias- variance compromise. 
Empirical results show that this method can achieve good 
generalization scores even with small sample sizes. The 
intuition behind this good performance is that a sample 
covariance matrix tends to over-estimate pair-wise corre- 
lations in the small-sample limit^. Shrinking them to zero 
thus improves the estimate. This procedure gives a good 
baseline to compare goodness of fit of models, however, it 
does not extract a graph structure from the observed time 
series, as it does not set zeros in the precision matrix. 

3.3. Decomposable graphical models 

Here, we introduce the notion of decomposable graphs^ 
and present some related properties interesting to describe 
brain functional networks. 

A graph is said to be chordal if each of its cycles of 
four or more nodes has a chord, i.e., an edge joining two 
nodes that are not adjacent in the cycle. A chordal graph 
can be divided in an ordered sequence of cliques^ Ci [46], 
as illustrated in Fig. 2. The intersection between two suc- 
cessive cliques Ci and Ci+i forms a complete subgraph^ 



moral graph of a directed graph is the equivalent undirected 
graph in terms of non-directed conditional independence relations 
[46] 

^Indeed the plug-in estimator for pair-wise correlations, 
r ^ {x,y)/(\\x\\\\y\\) , which also corresponds to the scaling of off- 
diagonal coefficients in the empirical covariance, is known to be bi- 
ased at small samples. In particular, this bias implies that the most 
probably observed correlation is smaller than the population value, 
as discussed in [57] page 520. 

clique is a fully-connected sub-graph: each nodes is connected 
to every other node. 

^A complete graph is a graph in which all pairs of nodes are 
connected. 




Figure 2: Structure of a decomposable graphical model. Left: graph 
representation. Right: corresponding precision matrix. Nodes can 
be ordered in such a way that they form a succession of cliques Cj, 
that are interconnected solely by separating sets S^. 

called a separating set Si . When the structure of a graph- 
ical probabilistic model is chordal, the model is said to be 
decomposable [58]. The different cliques Ci are then mu- 
tually independent conditionally on the separating set 5^, 
and the likelihood factors in terms corresponding to the 
different cHques and separating sets: 

C{X\K) = Y.^(^c.\KcJ-Z^(^s.\KsJ. (4) 

Si 

where Xd denotes the dataset reduced to the variables 
of clique Ci and Kd denotes the corresponding reduced 
precision matrix [46] . While there exists no closed formula 
for the Maximum Likelihood Estimate (MLE) of the pre- 
cision matrix for an arbitrary sparse graphical Gaussian 
model, the MLE for a decomposable model can be written 
as: 

KMLE = nKcA°-Z[Ks,f, (5) 

Ci Si 

where [^c^]^ denotes the MLE for the precision matrix 
of the fully-connected subset of X reduced to the cHque 
Ci, that is the inverse empirical covariance matrix, padded 
with zeros to the size of the full model [46] . Identifying the 
cliques and separating sets in a graphical model opens the 
door to an efficient estimation of the maximum-likelihood 
precision matrix. Note that a non-decomposable model 
can be turned in a decomposable model: cordless cliques 
can be completed to turn the graph in a chordal graph. 

The motivation for using decomposable models to learn 
brain connectivity is twofold. First, the covariance matri- 
ces are estimated at the clique level, which are smaller 
problems. This strongly limits the estimation error. More 
importantly, the description of the brain that decompos- 
able models give, cliques of regions that are mutually inde- 
pendent conditionally to set of hubs nodes, seems to match 
well the intuition of the functional networks composing 
brain architecture. Models sharing a similar intuition but 
not relying on a graphical description have been popular- 
ized in neuroimaging through the concept of integration of 
brain regions within a network [59]. 

4. Efficient learning of decomposable structures 

In this section, we introduce a new algorithm to learn 
decomposable models on large covariance matrices with- 
out a priori knowledge of the relevant cliques. Indeed, to 
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date, none of the standard covariance selection procedures 
for large-scale graphs seeks a decomposable model. While 
Bayesian approaches to learning Gaussian graphical mod- 
els often rely on decomposable models, they require sam- 
pHng of the space of models, and therefore are intractable 
on problems with more than a few dozen of nodes [60, 61]. 
Marlin and Murphy [62] recently proposed an algorithm 
to impose a clique structure on Gaussian graphical mod- 
els. However, their approach, with an O(p^) complexity, 
requires a known ordering of the nodes. 

We propose a fast algorithm, that we call FastDecomp, 
to learn a decomposable graph with a large-scale structure 
common to several datasets. Our goal is to find a structure 
of conditional independence between successive blocks of 
variables. Our algorithm is based on a greedy approach to 
select non-independent variables. 

The FastDecomp algorithm. Similarly to the PC- 
Algorithm, we start from a complete graph ^, compute 
pair-wise partial correlations conditioning on all other 
variables, and use them to test conditional independence 
between variable pairs to prune the graph. However, to 
keep bounds on the computational cost, we do not pursue 
to test conditional independence exhaustively. To test for 
independence between nodes Xi and Xj, conditioning 
on Xk^K = {1 . . j}, we first test conditional 

independence using Fisher's z-transform of the estimated 
partial correlation^: 



z{i,j\K) 



ilogfi^^V 

2 \ 1 - Pi,j\K I 



(6) 



z{i^j\K) has approximately a A/'(0, (n - p - distri- 
bution if pij\K = (see Riitimann and Biihlmann [56] for 
similar considerations). 

Once all the pair- wise partial correlations are tested, we 
find a good completion of the resulting graph to a chordal 
graph^. For this we apply the symmetric Reverse Cuthill- 
McKee (RCM) algorithm [63] on the corresponding adja- 
cency matrix to obtain a peripheral node^ and an order- 
ing minimizing the envelop of the graph. We start from 
the peripheral node and jump successively to the adjacent 
node furthest according to the RCM ordering, which gives 
us an enumeration of the cliques Ci of the chordal graph, 
and the separating sets Si. 

Finally, on each clique Ci (resp. separating set Si) we 
estimate the precision matrix by applying the Ledoit-Wolf 



Algorithm 1 FastDecomp. RCM stands for Reverse 
Cuthill-McKee algorithm [63]. LedoitWolf stands for co- 
variance estimation as in Eq. 3, [47]. 
Input: X, threshold /3 

Output: estimated precision matrix perfect elimi- 
nation order 

Compute Klw ^ (LedoitWolf(X))"^ 
Initialize ^ to a complete graph, 
for z = 1 to p do 
for j = 1 to p do 

Here we use 

(^Lw)i,j 



^J (XLw)i,i (^Lw)j,j 



if {z{ij\{l..p}\{ij})</3then 

Remove edge z, j from Q. 
end if 
end for 
end for 

io, order = RCM{g) 

Color the vertices of Q with order 

Initialize K to 8i p x p matrix of zeros. 

Set X ^ [X\...,X]. 

Sort X according to order. 

repeat 

ii ^ largest node in adj{io) given order. 

k ^ K + [(LedoitWolf(Xio...ij)"Y 
if ii^ p then 

iQ ^ smallest node in adj{ii + 1) given order. 

k ^ K - [(LedoitWolf(Xio...ij)"Y 
end if 

until ii = p 



estimator to the observed data, restricted to the clique 
(resp. separating set). We obtain the final precision ma- 
trix using formula 5. See listing 1 for a precise description 
of the algorithm. 

Computational cost. The Reverse Cuthill-McKee can be 
implemented in O(p^) [65]. The bottleneck of our algo- 
rithm is thus the last loop: calculating the precision on 
each clique and each separating set. The cost of this loop 
as bounded by p times a matrix inversion. The FastDe- 
comp algorithm is thus bounded by 0(p x p^). 



^We use the Ledoit-Wolf estimator to estimate the partial corre- 
lation, as it gives a good bias-variance compromise. 

^Finding the best possible chordal completion is know to be NP- 
hard [64]. 

peripheral node of a graph is a node for which the shortest 
path to another node is the diameter of the graph, that is the max- 
imum shortest path between all nodes. The reverse Cuthill-McKee 
algorithm does not directly output a peripheral node, but we use 
its graph traversal and dynamical programming to find a pseudo 
peripheral node for little added cost. 



5. Experimental results: graphical models of on- 
going brain activity 

5.1. A resting- state fMRI dataset 

We apply the various covariance-selection methods 
to estimate the Markov independence structure from a 
resting-state fMRI dataset acquired in a previous experi- 
ment [66]. 12 subjects were scanned in a resting task, re- 
sulting in a set of 810 brain volumes per subject acquired 



6 




Figure 3: Empirical covariance and associated empirical precision for 
one subject. On the precision matrix, the values on the diagonal are 
not represented. 

with a repetition of 1.5 second. The data were prepro- 
cessed using the SPM5 software^^, including in particu- 
lar the spatial normalization to standard template space. 
As in Achard et al. [25], or Huang et al. [49], we extract 
brain-activation time series on a parcellation of the gray 
matter in 105 non-overlapping regions^^. As we are inter- 
ested in modeling only gray-matter correlations, we regress 
out confound effects obtained by extracting signals in the 
white matter, the cortico-spinal fluid regions, as well as 
the rigid-body motion time courses estimated during data 
pre-processing. 

The correlation matrix of the resulting signals can be 
seen in Figure 3 (left). On this figure, regions are ordered 
according to the labels of the atlas that was used to define 
them [67]. This ordering groups together regions that be- 
long to the same general bilateral anatomical structures. 
A first visual analysis reveals large blocks of correlated 
regions. This can be interpreted as the signature of the 
so-called cognitive networks. 

The corresponding empirical precision matrix (Figure 
3, right) reveals the partial correlations between regions. 
It appears more sparse; in particular the block-structure 
is not visible. 

5.2. Generalization performance of the various estimators 

We learn from this dataset different Gaussian graphi- 
cal models of brain activity using the different covariance 
selection procedures: sparse penalized inverse covariance 
estimation^^, PC-DAG, and FastDecomp. We evaluate the 
goodness of fit of the different models using a leave-one-out 
procedure: we learn a sparse precision on 11 subjects^^ and 
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■'^■'^The complete atlas comprises 110 regions, but as the field of view 
of our fMRI images is reduced, it does not cover the cerebellum. 

■"^^We use an implementation of the -^i-penalized inverse covariance 
estimator following Rothman et al. [68] 

■"^^We apply the estimators on the concatenated individual data: 
X = [X^, . . . , X'^], after detrending, band-pass filtering, and 
variance-normalizing the individual time series. The generative 
model underlying this concatenation corresponds to the hypothe- 
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Figure 4: Cross-validation scores of the sparse precision estimators 
compared as a function of the width of maximum clique that they 
yield. For the FastDecomp algorithm, this width is varied by setting 
the /3 parameter of the algorithm. 

test the likelihood of the data corresponding to the 12^^ 
subject in the model described by this precision matrix. In 
the likelihood term (eq. 1), we use correlation matrices, 
rather than covariances, as we are not interested in the 
variance terms, but only in the correlation structure. 

As a baseline for the generalization likelihood, we use 
the Ledoit-Wolf diagonal shrinkage estimator that does 
not impose any structure or sparsity on the model. Each 
algorithm has a parameter that controls the degree of spar- 
sity of the estimated graph: for the sparse penalized esti- 
mator it is the amount of ii penalization -A in equation 2- 
and for the greedy methods, PC-DAG and FastDecomp, it 
is the threshold on the conditional independence test used 
to build the Markov independence graph - /3 in listing 1. 
If we set these parameters to maximize the likelihood of 
left-out data, we find that the best performing solutions 
are achieved with very little sparsity or penalization. 

The PC-DAG approach is fast when underlying graphs 
are very sparse. However, to achieve good cross-validation 
scores on our data, the p- value on the conditional indepen- 
dence test of the PC algorithm should be very small. As a 
result, the algorithm explores denser graphs. As the com- 
putational cost is in ©(e^), where k is the maximum node 
degree of the graph, we could not reach an optimal point 
in reasonable time. We interpret this result by the fact 
that the brain-connectivity network is poorly represented 
by a network with small maximum node degree. 

5.3. Trading off goodness of fit for decomposition into net- 
works 

We are interested in interpreting the graphs in terms of 
functional networks. As the networks found can be over- 



sis that all the different observations are drawn from the same dis- 
tribution. It is a common assumption in unsupervised analysis of 
resting-state dataset, popularized with ICA [69]. 

■"^^We also tried setting the amount of shrinkage by nested cross- 
validation, rather than using the Ledoit-Wolf oracle. We found that 
it gave the same generalization score. 
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Figure 5: Cross-validation scores as a function of the filling factor of 
the graph: the ratio of edges to the total possible number of edges, 
p'^ . The dashed blue line gives the loss in likelihood explained by the 
loss in degrees of freedom of the model compared to the Ledoit-Wolf 
model as the filling factor of the precision matrix decreases. 

lapping, the total number of cliques in the decomposable 
graph is not a good indication of the amount of "decom- 
position" . For this purpose, we study the maximum clique 
size, which correspond to the maximum number of nodes 
in a functional network (Figure 4). With a small maxi- 
mum clique size, the likelihood of the model (Eq. 4) fac- 
tors in terms comprising a small amount of brain regions, 
and the decomposition into conditionally-independent net- 
works in more easily interpret able. Note that, in this case, 
the description of networks is underpinned by a proba- 
bilistic model in which they appear in independent terms, 
unlike when using ICA [69, 70, 71, 28], or seed-based cor- 
relation analysis [30, 72, 1]. 

Likelihood as a function of clique size. Decomposable 
models estimated using FastDecomp maintain a general- 
ization score (likelihood on left out data) as good as the 
baseline for maximum clique size as large as 70% of the 
number of nodes but for smaller cliques their score de- 
creases, and starts dropping quickly for cliques smaller 
than 30% (Figure 4). While the models estimated with 
the sparse penalized inverse covariance estimator or the 
PC-DAG algorithm are not decomposable, the graph can 
be completed into a chordal graph^^ using the RCM algo- 
rithm. Increasing the penalization to create sparse graphs 
with the penalized estimator leads to a important loss in 
generalization scores without a sizable clique width re- 
duction (Figure 4). Indeed, completing the graphs to be 
chordal creates large cliques unless they are very sparse. 
This high sparsity is enforced by using a large penaliza- 
tion, which introduces bias in the estimated coefficient 
and is detrimental to the generalization likelihood. The 
PC-DAG does not suffer from this limitation. However, 
we could not use it to explore graphs with a degree higher 
than 14 due to available computational power as the com- 
plexity of this algorithm is exponential in the maximum 



Figure 6: Graphical model of brain connectivity estimated by the 
PC-DAG algorithm. The graph estimated is very sparse, due to 
cost of the algorithm, exponential in the maximum node degree. 
However, it displays some clear structure that recalls the structure 
of small- world graphs described by [12]. First, neighboring nodes 
on the cortical surface are connected, in a 2D lattice-like structure. 
Second, connections outside the lattice structure create shortcuts 
in the graph: opposite homologous regions, as well as a few inter- 
hemispheric connections. 

degree. Although very sparse, these graphs already have 
large cliques. 

Selecting a decomposed model. As the maximum clique 
size decreases, so does the filling factor of the precision 
matrix. As a result, we are comparing models with dif- 
ferent degrees of freedom and we should account for the 
difference in our model comparison, based on a likelihood 
ratio test. The expected difference in log-likelihood, under 
the null hypothesis that two models fit as well the data, is 
given by half the difference in degrees of freedom. When 
accounting for this loss (Figure 5), we find that for fill- 
ing factors above 70%, which corresponds to a maximum 
clique width of 60%, the model learned by FastDecomp 
performs as well as the baseline non-sparse model. 

5.4' Structure of the models 

Here we discuss the structure of brain connectivity that 
appears in the graph learned. 

Very sparse models. Only very sparse graph structures are 
directly interpret able. As can be seen on figure 5, PC- 
DAG is the algorithm best suited for this purpose as, un- 
like penalized methods, it does not bias the estimation 
of the precision matrix coefficient, and, unlike the Fast- 
Decomp algorithm, it does not force small cliques. It is 
interesting to note that these graphs have large cliques 
(Fig 4), even though we were limited to exploring small 
maximal node degrees due to available computational 
power. Such graphs (Figure 6) have a lattice-like structure 
-nearest-neighbor connections on the cortex- with the ad- 
dition of inter-hemispheric homologous connections and a 
few long-distance connections. 



■"^^ Strictly speaking, this node degree corresponds to the underly- 
ing DAG computed by the PC-DAG algorithm, and not the final 
undirected graph. 



This procedure is often called triangularization of the graph. 



Decomposition in networks. To assert the meaning of the 
decomposable models in neuroscientific terms, we apply 
FastDecomp with P chosen such that the maximum clique 
width is 50% of the full graph. As highlighted by the above 
cross- vahdation study, this results in a small loss in gener- 
alization scores. However, the resulting cliques and sepa- 
rating sets form small groups of regions and are therefore 
more interpret able. The estimated banded precision ma- 
trix is given in Figure 7. The differences between empirical 
and estimated covariance accumulate mainly outside the 
cHques that were selected by FastDecomp, corresponding 
to the long-range interactions between networks far apart 
according to the ordering learned. In Figure 8, we dis- 
play the regions forming the nodes of the graph, colored 
by their node ordering on a standard brain model. 

It is clear that the ordering is not random: regions of 
similar function are grouped together. For instance, the 
purple nodes form the primary visual cortex and are con- 
nected to blue regions along the well-known dorsal visual 
pathway, going up to the superior parietal lobule [73] . The 
separation of brain structure from the outer surface versus 
cingular regions, located in between the hemispheres, is 
also striking, as well as the symmetry of the model. How- 
ever, more interesting is that the red regions, representing 
high node order, form a left-lateral network in the tempo- 
ral lobe. It resembles the language network, the lateralized 
network most often recruited that is displayed on figure 
lb, as estimated from the same resting-state dataset using 
an ICA-based method [28]. We note that the ordering of 
regions located in the central cliques is more difficult to 
interpret as it appears to be representing several cognitive 
networks. This is consistent with the fact that these re- 
gions are part of larger cliques, that do not separate small 
structures. We defer a more thorough neuroscientific in- 
terpretation of the present results, and in particular a link 
with the knowledge available from anatomical connectivity 
data, to future work. 

A high bandwidth structure. While the graphical struc- 
tures that we extract are sparse, we can see that they 
must have a small average path length to fit the data. In- 
deed, we find that the adjacency matrix of the graph must 
have a high bandwidth, as the likelihood drops when forc- 
ing a small bandwidth^^ as for small clique sizes on figure 
4. The graph is therefore a high-bandwidth graph [74], 
which implies it has a small diameter^^ [75]. This in turns 
implies that average path length on the graph is small. In 
layman's term, any two nodes are always connected by a 
small distance on the graph, as they can not be separated 
by more than a few large networks. 

Small word properties. The high bandwidth that we ob- 
served is characteristic of small- world graphs. Our analysis 
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^^The RCM is an algorithm commonly used for ordering a graph 
to evaluate its bandwidth. 

■"^^The diameter of a graph is the maximal value of the shortest 
path lengths for all pairs of nodes in the graph 
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Figure 7: Reordered empirical covariance and precision matrix, as 
well as estimates by FastDecomp, and the difference between em- 
pirical and estimated values. The parameter /3 was set to limit the 
size of the largest clique to 50% of the nodes. Compared to figure 3, 
the nodes have been reordered in a complete elimination ordering of 
the decomposable model. The outline of the cliques is drawn using a 
black line. Note that, while the empirical precision and the estimated 
precision seem to differ significantly due to the values set to zeros 
outside the cliques selected, the empirical and estimated covariance 
look very similar. The errors on the covariance matrix inside the 
cliques selected are not visible, as they are 1/100 times smaller than 
outside the cliques. 
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Figure 8: Node ordering in the decomposable model, represented on 
the region centers, in a view of the brain. The color of the nodes 
indicates the order in which they appear, in the decomposable model 
estimated in Fig. 
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shows that this property is required for a model to fit the 
data weh. The other aspect of a smah-world graph, lo- 
cal properties of the graph such as clustering coefficient, is 
sensitive to sampling noise or the choice of thresholding or 
any other sparsity-enforcing parameter. It appears that, 
with our dataset, little sparsity is best to maximize good- 
ness of fit^^. To study the local properties of the graph, we 
can use the results of the PC-DAG algorithm. Indeed, the 
chordal completion of the graph used in FastDecomp in- 
creases artificially the clustering coefficient, and the penal- 
ized estimators give precision matrices that are not sparse 
enough to be considered as small world^^. On the graph 
extracted by the PC-DAG with the best goodness of fit, 
we find an average shortest path length of 3.0 and an av- 
erage node clustering coefficient of 0.19. These numbers 
are characteristic of small word graphs [12] and are con- 
sistent with the graph properties reported with previous 
functional-connectivity analysis of fMRI that did not rely 
on Markov models [26]. 

We note that the PC-DAG algorithm is ill suited to re- 
cover small- world graphs, where the presence of hub nodes 
leads to high computational costs. On the opposite, Fast- 
Decomp is well suited for extracting large interpretable 
cliques on such data, but not the local properties. 

6. Conclusion 

We have applied different covariance selection proce- 
dures to learn probabilistic graphical models -Markov 
models- of brain functional connectivity from resting- 
state fMRI data. We introduce a definition of the large- 
scale functional networks as the cliques of a decomposable 
graphical model. To learn efficiently decomposable mod- 
els on graphs presenting hub nodes, such as small- world 
graphs, we have introduced a new algorithm. By setting 
the width of the largest clique of the graph, we have in- 
vestigated the compromise between models interpretable 
in terms of independent functional networks, and models 
that generalize well. 

We find that the brain is best represented by a high- 
bandwidth graph that cannot be decomposed into small 
cliques: forcing a strongly decomposable representation 
breaks its small- world properties. However, we can learn 
a simplified model of the data in which the variables are 



■"^^The small amount of sparsity may be due to spurious effects, 
such as inter-subject variability, or the definition of the regions used 
to extract the signal. We used a standard parcellation of the brain 
based on the anatomy of a single subject that is used routinely in 
small world analysis of fMRI - to list only a few [24, 25, 49]- although 
it has been shown that small- worldness, measured by the ratio of 
clustering coefficient and average path length, varied as a function 
of the brain parcellation used [76]. 

These precision matrices have many small coefficients, such that 
thresholding them would put give sparsity level adapted to studying 
small world properties, however there is no statistically-controlled 
way of thresholding, as it would also create non positive definite 
matrices, for which the likelihood of the model is not defined. 



decomposed into smaller conditionally-independent units 
that can be interpreted as functional networks, match- 
ing the current knowledge in cognitive neuroscience. To 
summarize adequately brain functional connectivity with 
segregated networks, these must encompass many brain 
regions and be strongly overlapping. These preliminary 
results provide further evidence that the functional con- 
nectivity signals observed in fMRI reflect the small-world 
properties of brain connectivity. Markov models, describ- 
ing the conditional independence relations of the data, are 
a promising tool to investigate the large scale architecture 
of the brain via its functional connectivity. 
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